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Abstract. We state the problem of inverse reinforcement learning in 
terms of preference elicitation, resulting in a principled (Bayesian) sta- 
tistical formulation. This generalises previous work on Bayesian inverse 
reinforcement learning and allows us to obtain a posterior distribution 
on the agent's preferences, policy and optionally, the obtained reward 
sequence, from observations. We examine the relation of the resulting 
approach to other statistical methods for inverse reinforcement learning 
via analysis and experimental results. We show that preferences can be 
determined accurately, even if the observed agent's policy is sub-optimal 
with respect to its own preferences. In that case, significantly improved 
policies with respect to the agent's preferences are obtained, compared to 
both other methods and to the performance of the demonstrated policy. 

Key words: Inverse reinforcement learning, preference elicitation, de- 
cision theory, Bayesian inference 



1 Introduction 

Preference elicitation is a well-known problem in statistical decision theory [TU] . 
The goal is to determine, whether a given decision maker prefers some events to 
other events, and if so, by how much. The first main assumption is that there 
exists a partial ordering among events, indicating relative preferences. Then 
the corresponding problem is to determine which events are preferred to which 
others. The second main assumption is the expected utility hypothesis. This 
posits that if we can assign a numerical utility to each event, such that events 
with larger utilities are preferred, then the decision maker's preferred choice from 
a set of possible gambles will be the gamble with the highest expected utility. The 
corresponding problem is to determine the numerical utilities for a given decision 
maker. 

Preference elicitation is also of relevance to cognitive science and behavioural 
psychology, e.g. for determining rewards implicit in behaviour [19] where a proper 
elicitation procedure may allow one to reach more robust experimental conclu- 
sions. There are also direct practical apphcations, such as user modelling for 
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determining customer preferences [3]- Finally, by analysing the apparent prefer- 
ences of an expert while performing a particular task, we may be able to discover 
behaviours that match or even surpass the performance of the expert [1] in the 
very same task. 

This paper uses the formal setting of preference elicitation to determine the 
preferences of an agent acting within a discrete-time stochastic environment. We 
assume that the agent obtains a sequence of (hidden to us) rewards from the en- 
vironment and that its preferences have a functional form related to the rewards. 
We also suppose that the agent is acting nearly optimally (in a manner to be 
made more rigorous later) with respect to its preferences. Armed with this infor- 
mation, and observations from the agent's interaction with the environment, we 
can determine the agent's preferences and policy in a Bayesian framework. This 
allows us to generalise previous Bayesian approaches to inverse reinforcement 
learning. 

In order to do so, we define a structured prior on reward functions and 
policies. We then derive two different Markov chain procedures for preference 
elicitation. The result of the inference is used to obtain policies that are signifi- 
cantly improved with respect to the true preferences of the observed agent. We 
show that this can be achieved even with fairly generic sampling approaches. 

Numerous other inverse reinforcement learning approaches exist [ll 1181 1201 
[2T] . Our main contribution is to provide a clear Bayesian formulation of inverse 
reinforcement learning as preference elicitation, with a structured prior on the 
agent's utilities and policies. This generalises the approach of Ramachandran 
and Amir [TB] and paves the way to principled procedures for determining dis- 
tributions on reward functions, policies and reward sequences. Performance- wise, 
we show that the policies obtained through our methodology easily surpass the 
agent's actual policy with respect to its own utility. Furthermore, we obtain 
policies that are significantly better than those obtained with other inverse re- 
inforcement learning methods that we compare against. 

Finally, the relation to experimental design for preference elicitation (see [3] 
for example) must be pointed out. Although this is a very interesting planning 
problem, in this paper we do not deal with active preference elicitation. We 
focus on the sub-problem of estimating preferences given a particular observed 
behaviour in a given environment and use decision theoretic formalisms to derive 
efficient procedures for inverse reinforcement learning. 

This paper is organised as follows. The next section formalises the prefer- 
ence elicitation setting and relates it to inverse reinforcement learning. Section |3] 
presents the abstract statistical model used for estimating the agent's prefer- 
ences. Section [4] describes a model and inference procedure for joint estimation 
of the agent's preferences and its policy. Section [5] discusses related work in more 
detail. Section |6] presents comparative experiments, which quantitatively exam- 
ine the quality of the solutions in terms of both preference elicitation and the 
estimation of improved policies, concluding with a view to further extensions. 
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2 Formalisation of the problem 

We separate the agent's preferences (which are unknown to us) from the environ- 
ment's dynamics (which we consider known). More specifically, the environment 
is a controlled Markov process v — {S,A,T)^ with state space 5, action space 
and transition kernel T — {t{- \ s,a) : s € S ,a € A}, indexed in 5 x ^ such 
that t(- I s, a) is a probability measura^on S. The dynamics of the environment 
are Markovian: If at time t the environment is in state St G S and the agent 
performs action at G v4, then the next state St+i is drawn with a probability 
independent of previous states and actions: 

P,(st+i eS\s\a')= t{S \st,at), ScS, (2.1) 

where we use the convention s* = si, . . . , St and a* = ai, . . . ,at to represent 
sequences of variables. 

In our setting, we have observed the agent acting in the environment and 
obtain a sequence of actions and a sequence of states: 

D = (a ,s ), a =ai,...,aT, s =si,...,st- 

The agent has an unknown utility function, Ut, according to which it selects 
actions, which we wish to discover. Here, we assume that Ut has a structure cor- 
responding to that of reinforcement learning infinite-horizon discounted reward 
problems and that the agent tries to maximise the expected utility. 

Assumption 1 The agent's utility at time t is the total 'y- discounted return 
from time t: 

oo 

Ut^Y.^^'r^^ (2-2) 

k=t 

where 7 G [0, 1] is a discount factor, and the reward rt is given by the (stochastic) 
reward function p so that rt \ St = s, at — a ^ p{- \ s, a), (s, a) ^ S x A. 

This choice establishes correspondence with the standard reinforcement learning 
setting The controlled Markov process and the utility define a Markov decision 
process [IS] (MDP), denoted hy fj. = {S,A,T,p,j). The agent uses some policy 
TT to select actions with distribution 71(04 | St), which together with the Markov 
decision process p defines a Markov chain on the sequence of states, such that: 

V^.Ast+l e 5 I s*) = / t{S\ a, st) d^(a | St), (2.3) 

J A 

^ We assume the measurability of all sets with respect to some appropriate a-algebra. 

* In our framework, this is only one of the many possible assumptions regarding the 
form of the utility function. As an alternative example, consider an agent who collects 
gold coins in a maze with traps, and with a utility equal to the logarithm of the 
number of coins if it exists the maze, and zero otherwise. 



4 



Constantin A. Rothkopf and Christos Dimitrakakis 



where we use a subscript to denote that the probabihty is taken with respect 
to the process defined jointly by ^, tt. We shall use this notational convention 
throughout this paper. Similarly, the expected utility of a policy tt is denoted by 
'^fi,TT Ut- We also introduce the family of Q-value functions { : /i £ A^, vr G }, 
where M is a. set of MDPs, with QJ^ : S x A R such that: 

Qlis, a) ^ E,,, {Ut \st = s,at^a). (2.4) 

Finally, we use Q* to denote the optimal Q-value function for an MDP /i, such 
that: 

Q* (s, a) = sup g;^(s, a), Vse5,aeA (2.5) 

With a slight abuse of notation, we shall use Qp when we only need to distinguish 
between different reward functions p, as long as the remaining components of 
remain fixed. 

Loosely speaking, our problem is to estimate the reward function p and dis- 
count factor 7 that the agent uses, given the observations s'^, and some prior 
beliefs. As shall be seen in the sequel, this task is easier with additional assump- 
tions on the structural form of the policy tt. We derive two sampling algorithms. 
The first estimates a joint posterior distribution on the policy and reward func- 
tion, while the second also estimates a distribution on the sequence of rewards 
that the agent obtains. We then show how to use those estimates in order to 
obtain a policy that can perform significantly better than that of the agent's 
original policy with respect to the agent's true preferences. 



3 The statistical model 

In the simplest version of the problem, we assume that 7, v are known and we 
only estimate the reward function, given some prior over reward functions and 
policies. This assumption can be easily relaxed, via an additional prior on the 
discount factor 7 and CMP v. Let 7?. be a space of reward functions p and V 
to be a space of policies tt. We define a (prior) probabihty measure ^(- | v) on 
TZ such that for any B C TZ, ^{B \ ly) corresponds to our prior belief that the 
reward function is in B. Finally, for any reward function p ^ TZ, we define a 
conditional probability measure ■(/'(■ I Pi t^) on the space of policies V. Let Pa,T^a 
denote the agent's true reward function and policy respectively. The joint prior 
on reward functions and policies is denoted by: 



{P,R\v)^ [ V(P|p,i^)d^(p 



ly), Per, Ran, (3.1) 



such that </){■ I jy) is a probability measure on TZ x V. We define two models, 
depicted in Figure[l] The basic model, shown in Figure l(a)[ is defined as follows: 



P^C(-k), I Pa = P ■(/'(• I 
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(a) Basic model (b) Reward-augmented model 

Fig. 1. Graphical model, with reward priors ^ and policy priors ■0, while p and tt 
are the reward and policy, where we observe the demonstration D. Dark colours 
denote observed variables and light denote latent variables. The implicit depen- 
dencies on V are omitted for clarity. 



We also introduce a reward- augmented models where we explicitly model the 



rewards obtained by the agent, as shown in Figure 1(b) 



P^i{-\v), TT I Pa = I P,!^), r-t I Pa = P,st = s,at = a - p(- I s,a). 

For the moment we shall leave the exact functional form of the prior on the 
reward functions and the conditional prior on the policy unspecified. Neverthe- 
less, the structure allows us to state the following: 



Lemma 1. For a prior of the form specified in (3.1), and given a controlled 
Markov process v and observed state and action sequences s^,a'^, where the 
actions are drawn from a reactive policy tt, the posterior measure on reward 
functions is: 

^{B\s ,a — (3.2) 

where 7r(a'^ | s^) = ^{<^t\st)- 

Proof. Conditioning on the observations s'^,a'^ via Bayes' theorem, we obtain 
the conditional measure: 

^{B\s ,a " (3.3) 

jn'^{s\a^ I P,i^)dC(p I v) 

where ip{s'^,a'^ \ p,v) = j.pVy.^{s'^ ,a^) d%}]['K \ p^v) is a marginal likelihood 
term. It is easy to see via induction that: 

T 

P.,^(s^, a^) = n I I at-ust-i), (3.4) 

t=i 

where r(si | ao,so) — ''"(si) is the initial state distribution. Thus, the reward 
function posterior is proportional to: 



/ / \\^i°'t\st)T{st\at-i,St^i)Atp{TT\p,v)<l£,{p\v). 

J B JV 

Note that the r(s(|at_i, St-i) terms can be taken out of the integral. Since they 
also appear in the denominator, the state transition terms cancel out. □ 
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4 Estimation 

While it is entirely possible to assume that the agent's policy is optimal with 
respect to its utility (as is done for example in [T]), our analysis can be made 
more interesting by assuming otherwise. One simple idea is to restrict the policy 
space to stationary soft-max policies: 

exp(?7Q*(st,at)) 

^vi^-t St) = T^^— r^, (4.1) 

Eaexp(f7Q*(st,a)) 

where we assumed a finite action set for simplicity. Then we can define a prior on 
policies, given a reward function, by specifying a prior on the inverse temperature 
r], such that given the reward function and 77, the policy is uniquely determined 

For the chosen prior ( |4.1[ ) , inference can be performed using standard Markov 
chain Monte Carlo (MCMC) methods [S]. If we can estimate the reward function 
well enough, we may be able to obtain policies that surpass the performance of 
the original policy TTa with respect to the agent's reward function pa- 



Algorithm 1 MH: Direct Metropolis-Hastings sampling from the joint distri- 
bution (f){TT,p I ^s^). 

1: for fc = 1, . . . do 

2: p^i{p\u). 

3: fj ^ gamma{C,, 9) 

4: 7f = Softtna?c{p, rj, r) 

5: p = P.,*(s^, a^)/[C(p 1 '^)U...M C, e)]. 
6: w.p. min { l,p/p(k-i) } do 

7- T^(k) = T^, V{k) = V, P(k) = P, P(k) = P- 

8: else 

9: 7r(fe) = TV(k-i), V(k) = V^k-i), P(k) = P(k-i), P(k) = P(k^i)- 

10: done 

11: end for 



4.1 The basic model: A Metropolis- Hastings procedure 



Estimation in the basic model (Fig. 1 (a) ) can be performed via a Metropolis- 
Hastings (MH) procedure. Recall that performing MH to sample from some 
distribution with density f{x) using a proposal distribution with conditional 
density g{x \ x), has the form: 

X(i.\, otherwise. 



^ Our framework's generality allows any functional form relating the agent's pref- 
erences and policies. As an example, we could define a prior distribution over the 
e-optimafity of the chosen poHcy, without limiting ourselves to soft-max forms. This 
would of course change the details of the estimation procedure. 
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In our case, x = (p, tt) and f{x) = tt | , a-^ , j/) rl We use independent 
proposals g{x) = ^(p, nlv). As (j){p, 7r|s"^, , v) = (j){s'^ , cL\p, it, v)(f>{p, n)/(j){s'^ , a^), 
it follows that: 

(j}{p,n I 3^,0^, v) _ V^^^{s'^,a'^)(j){p,TT \ v) 



(l)(p,Tr I s'^,a'^,v) P^.,^(^,(s'^,a^)(/)(p(fe),7r(fe) | v)' 

This gives rise to the sampling procedure described in Alg. [ij which uses a 
gamma prior for the temperature. 

4.2 The augmented model: A hybrid Gibbs procedure 



The augmented model (Fig. 1(b)) enables an alternative, a two-stage hybrid 



Gibbs sampler, described in Alg. 2] This conditions alternatively on a reward 
sequence sample and on a reward function sample at the fc-th iteration 
of the chain. Thus, we also obtain a posterior distribution on reward sequences. 

This sampler is of particular utility when the reward function prior is conju- 
gate to the reward distribution, in which case: (i) The reward sequence sample 
can be easily obtained and (ii) the reward function prior can be conditioned on 
the reward sequence with a simple sufhcient statistic. While, sampling from the 
reward function posterior continues to require MH, the resulting hybrid Gibbs 
sampler remains a valid procedure [S] , which may give better results than spec- 
ifying arbitrary proposals for pure MH sampling. 

As previously mentioned, the Gibbs procedure also results in a distribution 
over the reward sequences observed by the agent. On the one hand, this could 
be valuable in applications where the reward sequence is the main quantity 
of interest. On the other hand, this has the disadvantage of making a strong 
assumption about the distribution from which rewards are drawn. 



Algorithm 2 G-MH: Two stage Gibbs sampler with an MH step 
1: for fc = 1, . . . do 

2: rffe_i),i.). 

3: 77 ~ gamma{C,, 6) 

4: TT = Softma?c(p, e, r) 

5: p = P,,s(s^,a^)/[C(p I u)fg,„,„,{fi-(:,e)]. 

6: w.p. min{ l,p/p(k~i) } do 

7: 7r(fc) = TT, J7(fc) = 7?, = p, p^k) = P- 

8: else 

9: 7r(fe) = TV(k-i), ri(k) = V(k-i), P{k) = P(k-i), P(k) = P(k-i)- 

10: done 

11: rf,) I s^,a^~pf,)(s^,a^) 

12: end for 



Here we abuse notation, using 0(p, tt | •) to denote the density or probability function 
with respect to a Lebesgue or counting measure associated with the probability 
measure 4>{B \ ■) on subsets of 72. x P 
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5 Related work 

5.1 Preference elicitation in user modelling 

Preference elicitation lias attracted a lot of attention in the field of user mod- 
elling and online advertising, where two main problems exist. The first is how to 
model the (uncertain) preferences of a large number of users. The second is the 
problem of optimal experiment design [see [7J ch. 14] to maximise the expected 
value of information through queries. Some recent models include: Braziunas and 
Boutilier [3] who introduced modelling of generalised additive utilities; Chu and 
Ghahramani fS|, who proposed a Gaussian process prior over preferences, given a 
set of instances and pairwise relations, with applications to multiclass classifica- 
tion; Bonilla et al. [2], who generalised it to multiple users; [13], which proposed 
an additively decomposable multi-attribute utility model. Experimental design 
is usually performed by approximating the intractable optimal solution [31 [7] . 

5.2 Inverse reinforcement learning 

As discussed in the introduction, the problems of inverse reinforcement learning 
and apprenticeship learning involve an agent acting in a dynamic environment. 
This makes the modelling problem different to that of user modelling where 
preferences are between static choices. Secondly, the goal is not only to determine 
the preferences of the agent, but also to find a policy that would be at least as 
good that of the agent with respect to the agent's own preferences [^Finally, the 
problem of experiment design does not necessarily arise, as we do not assume to 
have an influence over the agent's environment. 

Linear programming One interesting solution proposed by [14] is to use a 
linear program in order to find a reward function that maximises the gap between 
the best and second best action. Although elegant, this approach suffers from 
some drawbacks, (a) A good estimate of the optimal policy must be given. This 
may be hard in cases where the demonstrating agent does not visit all of the 
states frequently, (b) In some pathological MDPs, there is no such gap. For 
example it could be that for any action a, there exists some other action a' with 
equal value in every state. 

Policy walk Our framework can be seen as a generalisation of the Bayesian 
approach considered in [18[ , which does not employ a structured prior on the re- 
wards and policies. In fact, they implicitly define the joint posterior over rewards 
and policies as: 



Interestingly, this can also be seen as the goal of preference elicitation when applied 
to multiclass classification [seajG] for example]. 
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which imphes that the exponential term corresponds to ^(s"^,a^,7r | p). This 
ad hoc choice is probably the weakest point in this approach}^ Rearranging, we 
write the denominator as: 

^(s^, a^\iy)= [ \ tt, p, v) d^(p, tt \ v), (5.1) 

which is still not computable, but we can employ a Metropolis-Hastings step 
using ^(p I I/) as a proposal distribution, and an acceptance probability of: 

C(7r,p|s^,a^)/^(p) ^ exp[7yEtQp(^t,"0] 
e(7r',p' I s^,a^)/e(p') exp[77EtQp''(5t,«t)]' 

We note that in [15], the authors employ a different sampling procedure than 
a straightforward MH, called a policy grid walk. In exploratory experiments, 
where we examined the performance of the authors' original method |17j . we 
have determined that MH is sufScient and that the most crucial factor for this 
particular method was its initialisation: as will be also be seen in Sec. [6] we only 
obtained a small, but consistent, improvement upon the initial reward function. 



The mELximum entropy approach. A maximum entropy approach is re- 
ported in [21]. Given a feature function ^ : 5 x — > K,", and a set of trajecto- 
ries I s^'^j, a^^j : /c = 1, . . . , n |, they obtain features <?^^j = (^(si,(fc), aj,(fe)))2i- 
They show that given empirical constraints ^e^i,(S>^^ — where E^-^ = 

^ X]fc=i ^(fc) empirical feature expectation, one can obtain a maximum 

entropy distribution for actions of the form ^siat \ Sj) oc <2'(st,ot)^ j£ ^ jg ^j^g 
identity, then B can be seen as a scaled state-action value function. 

In general, maximum entropy approaches have good minimax guarantees |12j . 
Consequently, the estimated policy is guaranteed to be close to the agent's. 
However, at best, by bounding the error in the policy, one obtains a two-sided 
high probability bound on the relative loss. Thus, one is almost certain to perform 
neither much better, nor much worse that the demonstrator. 



Game theoretic approach An interesting game theoretic approach was sug- 
gested by [50] for apprenticeship learning. This also only requires statistics of 
observed features, similarly to the maximum entropy approach. The main idea is 
to find the solution to a game matrix with a number of rows equal to the num- 
ber of possible policies, which, although large, can be solved efficiently by an 
exponential weighting algorithm. The method is particularly notable for being 
(as far as we are aware of) the only one with a high-probability upper bound on 
the loss relative to the demonstrating agent and no corresponding lower bound. 

* Although, as mentioned in [18], such a choice could be justifiable through a max- 
imum entropy argument, we note that the maximum-entropy based approach re- 
ported in [22] does not employ the value function in that way. 
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Thus, this method may in principle lead to a significant improvement over the 
demonstrator. Unfortunately, as far as we are aware of, sufficient conditions for 
this to occur are not known at the moment. In more recent work |21| . the au- 
thors have also made an interesting link between the error of a classifier trying 
to imitate the expert's behaviour and the performance of the imitating policy, 
when the demonstrator is nearly optimal. 
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Fig. 2. Total loss i with respect to the optimal policy, as a function of the inverse 
temperature rj of the softmax policy of the demonstrator for (a) the Random 
MDP and (b) the Random Maze tasks, averaged over 100 runs. The shaded areas 
indicate the 80% percentile region, while the error bars the standard error. 



6 Experiments 
6.1 Domains 

We compare the proposed algorithms on two different domains, namely on ran- 
dom MDPs and random maze tasks. The Random MDP task is a discrete-state 
MDP, with four actions, such that each leads to a different, but possibly overlap- 
ping, quarter of the state set j^The reward function is drawn from a Beta-product 
hyperprior with parameters and where the index i is over all state-action 

® The transition matrix of the MDPs was chosen so that the MDP was communicating 
(c.f. [16]) and so that each individual action from any state results in a transition 
to approximately a quarter of all available states (with the destination states ar- 
rival probabilities being uniformly selected and the non-destination states arrival 
probabilities being set to zero). 
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pairs. This defines a distribution over the parameters pi of the Bernoulh dis- 
tribution determining the probabihty of the agent of obtaining a reward when 
carrying out an action a in a particular state s. 

For the Random Maze tasks we constructed planar grid mazes of different 
sizes, with four actions at each state, in which the agent has a probabihty of 0.7 
to succeed with the current action and is otherwise moved to one of the adjacent 
states randomly. These mazes are also randomly generated, with the rewards 
function being drawn from the same prior. The maze structure is sampled by 
randomly filling a grid with walls through a product-Bernoulli distribution with 
parameter 1/4, and then rejecting any mazes with a number of obstacles higher 
than |5|/4. 

6.2 Algorithms, priors and parameters 

We compared our methodology, using the basic (MH) and the augmented (G- 
MH) model, to three previous approaches. The linear programming (LP) based 
approach [T3j, the game-theoretic approach (MWAL) [20] and finally, the Bayesian 
inverse reinforcement learning method (PW) suggested in [T5|. In all cases, each 
demonstration was a T-long trajectory ,a'^ , provided by a demonstrator em- 
ploying a softmax policy with respect to the optimal value function. 

All algorithms have some parameters that must be selected. Since our method- 
ology employs MCMC the sampling parameters must be chosen so that conver- 
gence is ensured. We found that 10"* samples from the chain were sufficient, for 
both the MH and hybrid Gibbs (G-MH) sampler, with 2000 steps used as burn- 
in, for both tasks. In both cases, we used a gamma prior Qamma(\, 1) for the 
inverse temperature parameter 77 and a product-beta prior !Beta''^'(l, 1) for the 
reward function. Since the beta is conjugate to the Bernoulli, this is what we 
used for the reward sequence sampling in the G-MH sampler. Accordingly, the 
conditioning performed in step 11 of G-MH is closed-form. 

For PW, we used a MH sampler seeded with the solution found by [T4!, as 
suggested by (17j and by our own preliminary experiments. Other initialisations, 
such as sampling from the prior, generally produced worse results. In addition, 
we did not find any improvement by discretising the sampling space. We also 
verified that the same number of samples used in our case was also sufficient for 
this method to converge. 

The linear-programming (LP) based inverse reinforcement learning algo- 
rithm by Ng and Russell [H] requires the actual agent policy as input. For 
the random MDP domain, we used the maximum likelihood estimate. For the 
maze domain, we used a Laplace-smoothed estimate (a product-Dirichlet prior 
with parameters equal to 1) instead, as this was more stable. 

Finally, we examined the MWAL algorithm of Syed and Schapire [20] . This 
requires the cumulative discounted feature expectation as input, for appropri- 
ately defined features. Since we had discrete environments, we used the state oc- 
cupancy as a feature. The feature expectations can be calculated empirically, but 
we obtained better performance by first computeing the transition probabilities 
of the Markov chain induced by the maximum likelihood (or Laplace-smoothed) 
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policy and then calculating the expectation of these features given this chain. 
We set all accuracy parameters of this algorithm to 10"'^, which was sufficient 
for a robust behaviour. 



6.3 Performance measure 

In order to measure performance, we plot the Li los of the value function of 
each policy relative to the optimal policy with respect to the agent's utility: 

(6.1) 

ses 

where V*{s) = maxaQ;(s,a) and V^{s) = E^Q^is^a). 

In all cases, we average over 100 experiments on an equal number of ran- 
domly generated environments fii, fi2, ■ ■ ■■ For the «-th experiment, we generate 
a T-step-long demonstration Di = (s^,a^) via an agent employing a softmax 
policy. The same demonstration is used across all methods to reduce variance. 
In addition to the empirical mean of the loss, we use shaded regions to show 
80% percentile across trials and error bars to display the standard error. 




(a) Effect of amount of data (b) Effect of environment size 



Fig. 3. Total loss £ with respect to the optimal policy, in the Random MDP 
task. Figure 3(a) shows how performance improves as a function of the length 
T. of the demonstrated sequence. Figure |3(b)| shows the effect of the number 
of states \S\ of the underlying MDP. All quantities are averaged over 100 runs. 
The shaded areas indicate the 80% percentile region, while the error bars the 
standard error. 



This loss can be seen as a scaled version of the expected loss under a uniform state 
distribution and is a bound on the Loo loss. The other natural choice of the optimal 
policy stationary state distribution is problematic for non-ergodic MDPs. 
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6.4 Results 



We consider the loss of five different policies. The first, soft, is the policy of 
the demonstrating agent itself. The second, MH, is the Metropolis-Hastings 
procedure defined in Alg. [TJ while G-MH is the hybrid Gibbs procedure from 
Alg. [2] We also consider the loss of our implementations of Linear Programming 
(LP), Policy Walk (PW), and MWAL, as summarised in Sec.jsj 

We first examined the loss of greedy policiesj^ derived from the estimated 
reward function, as the demonstrating agent becomes greedier. Figure [2] shows 
results for the two different domains. It is easy to see that the MH sampler sig- 
nificantly outperforms the demonstrator, even when the latter is near-optimal. 
While the hybrid Gibbs sampler's performance lies between that of the demon- 
strator and the MH sampler, it also estimates a distribution over reward se- 
quences as a side-effect. Thus, it could be of further value where estimation of 
reward sequences is important. We observed that the performance of the baseline 
methods is generally inferior, though nevertheless the MWAL algorithm tracks 
the demonstrator's performance closely. 

This suboptimal performance of the baseline methods in the Random MDP 
setting cannot be attributed to poor estimation of the demonstrated policy, as 
can clearly be seen in Figure 3(a)[ which shows the loss of the greedy policy 



derived from each method as the amount of data increases. While the proposed 
samplers improve significantly as observations accumulate, this effect is smaller 
in the baseline methods we compared against. As a final test, we plot the relative 
loss in the Random MDP as the number of states increases in Figure [3(b)| We 
can see that the relative performance of methods is invariant to the size of the 
state space for this problem. 

Overall, we observed the basic model (MH) consistently outperform^^ the 
agent in all settings. The augmented model (G-MH), while sometimes outper- 
forming the demonstrator, is not as consistent. Presumably, this is due to the 
joint estimation of the reward sequence. Finally, the other methods under con- 
sideration on average do not improve upon the initial policy and can be, in a 
large number of cases, significantly worse. For the linear programming inverse 
RL method, perhaps this can be attributed to implicit assumptions about the 
MDP and the optimality of the given policy. For the policy walk inverse RL 
method, our belief is that its suboptimal performance is due to the very re- 
strictive prior it uses. Finally, the performance of the game theoretic approach 
is slightly disappointing. Although it is much more robust than the other two 
baseline approaches, it never outperforms the demonstrator, even thought tech- 
nically this is possible. One possible explanation is that since this approach is 
worst-case by construction, it results in overly conservative policies. 



^ Experiments with non-greedy policies (not shown) produced generally worse results. 
It was pointed out by the anonymous reviewers, that the loss we used may be biased. 
Indeed, a metric defined over some other state distribution, could give diflerent 
rankings. However, after looking at the results carefully we determined that the 
policies obtained via the MH sampler were strictly dominating. 
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7 Discussion 

We introduced a unified framework of preference eficitation and inverse reinforce- 
ment learning, presented two statistical inference models, with two corresponding 
sampling procedures for estimation. Our framework is flexible enough to allow 
using alternative priors on the form of the policy and of the agent's preferences, 
although that would require adjusting the sampling procedures. In experiments, 
we showed that for a particular choice of policy prior, closely corresponding to 
previous approaches, our samplers can outperform not only other well-known 
inverse reinforcement learning algorithms, but the demonstrating agent as well. 

The simplest extension, which we have already alluded to, is the estimation of 
the discount factor, for which we have obtained promising results in preliminary 
experiments. A slightly harder generalisation occurs when the environment is 
not known to us. This is not due to difficulties in inference, since in many cases 
a posterior distribution over A4 is not hard to maintain (see for example pillSj). 
However, computing the optimal policy given a belief over MDPs is harder [S], 
even if we limit ourselves to stationary policies [TT] . We would also like to consider 
more types of preference and policy priors. Firstly, the use of spatial priors for the 
reward function, which would be necessary for large or continuous environments. 
Secondly, the use of alternative priors on the demonstrator's policy. 

The generality of the framework allows us to formulate different preference 
elicitation problems than those directly tied to reinforcement learning. For ex- 
ample, it is possible to estimate utilities that are not additive functions of some 
latent rewards. This does not appear to be easily achievable through the exten- 
sion of other inverse reinforcement learning algorithms. It would be interesting 
to examine this in future work. 

Another promising direction, which we have already investigated to some 
degree |8], is to extend the framework to a fully hierarchical model, with a 
hyperprior on reward functions. This would be particularly useful for modelling 
a population of agents. Consequently, it would have direct applications on the 
statistical analysis of behavioural experiments. 

Finally, although in this paper we have not considered the problem of ex- 
perimental design for preference elicitation (i.e. active preference elicitation), we 
believe is a very interesting direction. In addition, it has many applications, such 
as online advertising and the automated optimal design of behavioural exper- 
iments. It is our opinion that a more effective preference elicitation procedure 
such as the one presented in this paper is essential for the complex planning task 
that experimental design is. Consequently, we hope that researchers in that area 
will find our methods useful. 
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